11/18/2024
Use the left and right arrow keys to navigate the presentation forward and backward respectively. You can also use the arrows at the bottom right of the screen to navigate with a mouse.
FAIR USE ACT DISCLAIMER: This site is for educational purposes only. This website may contain copyrighted material, the use of which has not been specifically authorized by the copyright holders. The material is made available on this website as a way to advance teaching, and copyright-protected materials are used to the extent necessary to make this class function in a distance learning environment. The Fair Use Copyright Disclaimer is under section 107 of the Copyright Act of 1976, allowance is made for fair use for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research.
Recall our MLR Model assumptions: \( E(Y) = \mathbf{X}\boldsymbol\beta \), and \( \epsilon_i \) are all independent with mean 0 and variance \( \sigma^2 \).
Frequently, we encounter data sets with additional structure, e.g.,
Think of such data as having 1 or more levels of group structure, with variation in within-group regression parameters.
Mathematically, we might posit a regression model for individual \( i \) (who belongs to group \( j \)) as
Group \( j \) regression parameters are the sum of the population-level estimate and group-level estimated deviation: \( \beta_{k}+\beta_{kj} \).
Notice we haven't made any assumptions yet about whether any relationships exist (or not) between these group-level deviations from the population level parameters. We'll explore some options shortly.
These data (from the cv
package) track the weight of 48 individual pigs over 9 weeks (weight in kilograms).
library(cv) # Cross Validation package
head(Pigs)
id week weight
1 1 1 24.0
2 1 2 32.0
3 1 3 39.0
4 1 4 42.5
5 1 5 48.0
6 1 6 54.5
summary(Pigs)
id week weight
Min. : 1.00 Min. :1 Min. :20.00
1st Qu.:12.75 1st Qu.:3 1st Qu.:37.00
Median :24.50 Median :5 Median :50.50
Mean :24.50 Mean :5 Mean :50.41
3rd Qu.:36.25 3rd Qu.:7 3rd Qu.:63.50
Max. :48.00 Max. :9 Max. :88.00
Source: Data were provided by Dr. Philip McCloud of Monash University, Melbourne, to the authors of the book Analysis of longitudinal data, by Diggle, Liang, & Zeger (1994). https://doi.org/10.1093/oso/9780198524847.001.0001.
plot(weight ~ week, data = Pigs, type = "n")
for (i in 1:max(Pigs$id)) { with(Pigs,
lines(1:9, Pigs[id == i, "weight"], type="o", pch=19, col=rainbow(48)[i])
)}
abline(lm(weight ~ week, data = Pigs), lwd = 4)
legend('topleft',"Overall best fit line", lty=1, lwd=4)
Q: Do the individual pigs look like their growth curves have the same slopes and intercepts?
Goal: Quantify the variability in regression parameters at different levels of the data heirarchy.
Option 1: One (unwise) approach to modelling data like these is to just ignore the individual differences and pretend all individuals are independently and identically distributed.
Option 2: Another (unwise) approach is to estimate separate regression parameters for each pig (e.g., using interaction terms)
Heirarchical models, like linear mixed effects models, try and strike a balance between these two options.
Recall our tentative model from a few slides back, for individual \( i \) in group \( j \): \[ Y_{i} = \beta_0+\beta_{0j} + \sum_{k=1}^p(\beta_k+\beta_{kj})X_{ki}+\epsilon_i \] LME Assumption: group-level parameter deviations (\( \beta_{kj} \)) are randomly sampled from a Normal(\( \beta_k \), \( \sigma_{\beta_k} \)) distribution.
Some important differences between this approach and option #2 above:
Random effects (intercept only) model with no predictors, where we'll deviate from the notation used in Zuur et al (2009) and index individuals as the \( i^{th} \) individual in group \( j \):
\[ Y_{ij} = \beta_0 + \beta_{0j} + \epsilon_{ij} \]
Assumptions:
Random intercept model Mixed effect for intercept, fixed effect for one predictor \( X \) (just one, for simplicity):
\[ Y_{ij} = \beta_0+\beta_{0j} + \beta_1 \, X_{ij} + \epsilon_{ij} \]
Random intercept and slope model with mixed effects for both slope and intercept:
\[ Y_{ij} = \beta_0+\beta_{0j} + (\beta_1+\beta_{1j})X_{ij}+\epsilon_{ij} \]
Let's simulate some data and explore the use of the tools in lme4
.
# Simulate data with a random effect for the slope on predictor X
set.seed(757)
Ngroups = 7; N=20 # per group
b0=50 # fixed, not random, intercept;
mean.b1=2.5 # random slope
sd.b1=1
b1=sort(rnorm(Ngroups, mean.b1, sd.b1))
b1 # random slopes to use for each Group
[1] 1.595831 1.729731 1.900448 2.034316 2.285474 2.700080 3.802416
X=runif(N*Ngroups,1,100)
Y=rnorm(length(X), b0 + rep(b1,each=N)*X, sd=16)
dat=data.frame(Y=Y,X=X,
Group=factor(rep(1:Ngroups,each=N)))
head(dat,2)
Y X Group
1 70.32508 24.96940 1
2 146.51905 65.44526 1
plot(X,Y, col=rep(rainbow(Ngroups), each=N), pch=20) # Color by group
abline(lm(Y~X), lty=1, lwd=2, col="black") # Best fit line ignoring groups
# best fit line within each group
for(i in 1:length(b1)) {
abline(lm(Y~X, data=dat[dat$Group==i,]), lty=2, col=rainbow(Ngroups)[i])
}
The default action is to estimate parameters via REML, which gives unbiased coefficient estimates.
REML=F
(use ML) to fit models for comparison via AIC, etc.# best fit lines for each group using a full mixed effects model
library(lme4) # Use ML when doing model comparison or variable selection! Otherwise, REML.
rfit = lmer(Y ~ X + (X | Group), REML=T, data=dat);rfit
Linear mixed model fit by REML ['lmerMod']
Formula: Y ~ X + (X | Group)
Data: dat
REML criterion at convergence: 1200.681
Random effects:
Groups Name Std.Dev. Corr
Group (Intercept) 1.0244
X 0.7363 1.00
Residual 15.8266
Number of obs: 140, groups: Group, 7
Fixed Effects:
(Intercept) X
49.432 2.282
optimizer (nloptwrap) convergence code: 0 (OK) ; 0 optimizer warnings; 1 lme4 warnings
Note that (X | Group)
term gives a random effect for slope and intercept!
library(lmerTest); # lmerTest adds hypothesis test info in the summary outputs
rfit1 = lmer(Y ~ X + (X-1 | Group), data=dat); summary(rfit1) # first run library(lmerTest)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Y ~ X + (X - 1 | Group)
Data: dat
REML criterion at convergence: 1200.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.81692 -0.60154 0.04225 0.67061 2.55511
Random effects:
Groups Name Variance Std.Dev.
Group X 0.5631 0.7504
Residual 250.6692 15.8325
Number of obs: 140, groups: Group, 7
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 49.2571 2.7496 132.0413 17.914 < 2e-16 ***
X 2.2844 0.2875 6.2696 7.945 0.000169 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
X -0.143
We can run GLMs as you might expect, using glmer
instead of glm
. This should be equivalent to the previous code.
grfit = glmer(Y ~ X + (X-1 | Group), data=dat, family=gaussian); summary(grfit)
Linear mixed model fit by REML ['lmerMod']
Formula: Y ~ X + (X - 1 | Group)
Data: dat
REML criterion at convergence: 1200.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.81692 -0.60154 0.04225 0.67061 2.55511
Random effects:
Groups Name Variance Std.Dev.
Group X 0.5631 0.7504
Residual 250.6692 15.8325
Number of obs: 140, groups: Group, 7
Fixed effects:
Estimate Std. Error t value
(Intercept) 49.2571 2.7496 17.914
X 2.2844 0.2875 7.945
Correlation of Fixed Effects:
(Intr)
X -0.143
isSingular(rfit1) # See ?isSingular for details. False is good. :)
[1] FALSE
# Here are the best fit parameters for each group...
coefficients(rfit1)$Group
(Intercept) X
1 49.25711 1.630928
2 49.25711 1.736566
3 49.25711 1.953450
4 49.25711 2.046965
5 49.25711 2.202513
6 49.25711 2.600950
7 49.25711 3.819426
# How do these compare to the data simulation (true) values?
cbind(True.Slopes=b1, Est.Slopes=coefficients(rfit1)$Group[,2])
True.Slopes Est.Slopes
[1,] 1.595831 1.630928
[2,] 1.729731 1.736566
[3,] 1.900448 1.953450
[4,] 2.034316 2.046965
[5,] 2.285474 2.202513
[6,] 2.700080 2.600950
[7,] 3.802416 3.819426
plot(b1, coefficients(rfit1)$Group[,2], xlab="True Slopes for Groups",
ylab="Estimated Slopes for Groups");
abline(a=0, b=1) # intercept 0, slope
# Best fit line within each group: True vs estimated
plot(X,Y, col=rep(rainbow(Ngroups), each=N), pch=20)
abline(lm(Y~X), lty=3, lwd=5, col="black")
for(i in 1:length(b1)) {
abline(b0, b1[i], lty=2, lwd=2, col=rainbow(Ngroups)[i]);
abline(unlist(coefficients(rfit1)$Group[i,]), lty=1, lwd=2, col=rainbow(Ngroups)[i]);
}
legend('topleft',c("True","Estimated","SLR"), lty=c(2,1,3),lwd=c(2,2,5))
Pfit = lmer(weight ~ week + (week | id), data=Pigs); summary(Pfit) # library(lmerTest) to get p-vals
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: weight ~ week + (week | id)
Data: Pigs
REML criterion at convergence: 1740.9
Scaled residuals:
Min 1Q Median 3Q Max
-3.6202 -0.5474 0.0150 0.5486 2.9939
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 6.986 2.6432
week 0.380 0.6164 -0.06
Residual 1.597 1.2637
Number of obs: 432, groups: id, 48
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 19.35561 0.40387 47.00004 47.93 <2e-16 ***
week 6.20990 0.09204 47.00077 67.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
week -0.132
isSingular(Pfit) # FALSE is good
[1] FALSE
coefficients(Pfit)
$id
(Intercept) week
1 19.59583 5.813491
2 17.76410 6.721253
3 15.81816 6.531738
4 21.33010 5.436096
5 20.66434 5.283964
6 18.20918 5.664767
7 16.75841 6.250396
8 18.21689 6.040443
9 16.16318 5.473554
10 20.51879 6.212529
11 19.56571 6.274587
12 18.61198 6.203627
13 18.00780 6.005597
14 18.41249 5.722986
15 16.61927 7.288605
16 15.51606 6.127593
17 26.11286 6.276210
18 22.65131 4.832620
19 24.63000 6.313625
20 20.78559 6.013639
21 20.72208 5.660488
22 22.65456 6.107961
23 20.11090 7.050579
24 20.78047 6.170029
25 15.63791 5.769765
26 15.28836 5.896836
27 17.99367 5.520279
28 16.93479 6.214271
29 19.56767 7.283895
30 16.68012 5.278910
31 17.19519 6.515998
32 17.25231 7.369767
33 19.43871 6.178544
34 18.80121 6.997048
35 17.77885 6.729328
36 20.50274 5.938423
37 18.83326 6.324742
38 19.39256 6.975793
39 19.00195 6.523202
40 18.26821 6.307325
41 19.34824 5.120790
42 19.45602 6.108424
43 19.37264 5.903588
44 22.06386 6.262233
45 25.83643 6.310636
46 20.27636 6.583958
47 22.86370 7.363327
48 21.06463 7.121539
attr(,"class")
[1] "coef.mer"
matplot(1:9, matrix(Pigs$weight, ncol=48, nrow=9, byrow=F), type="p",
col=rep(rainbow(48)), pch=20, ylab="Weight", xlab="Week")
for(i in 1:48) { abline(unlist(coefficients(Pfit)$id[i,]), lty=1, lwd=1, col=rainbow(48)[i]);
abline(lm(weight~week,data=Pigs[Pigs$id==i,])$coefficients, lty=2, lwd=1, col=rainbow(48)[i]); }
legend('topleft',c("Individually Estimated","LMER Estimated"), lty=c(2,1),lwd=c(2,2))
plot(1:9, rep(0,9), ylim=c(-1,1), ylab="Weight", xlab="Week", type="n")
for(i in 1:48) { # differences between best fit lines: LMER - indvidually fit.
abline(unlist(coefficients(Pfit)$id[i,]) - lm(weight~week,data=Pigs[Pigs$id==i,])$coefficients, lty=1, lwd=1, col=rainbow(48)[i]); }
Replace lm()
with lmer()
and glm()
with glmer()
(in the lme4
package) to include random effects.
nlme
and other packages exist: The lme4
package is largely an extension of nlme
that adds some GLM functionality, however in the past nlme
had some functionality (to address heteroskedasticity or correlated errors) that is lacking in lme4
.Terminology: Many names for the same concepts! See some of this terminology at https://statmodeling.stat.columbia.edu/2019/09/18/all-the-names-for-hierarchical-and-multilevel-modeling/
Diagnostics, variable selection, etc. weren't covered. More worked examples can be found at